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Abstract 

We describe a dynamic programming algorithm for predicting opti- 
mal RNA secondary structure, including pseudoknots. The algorithm 
has a worst case complexity of 0{N^) in time and 0{N''') in storage. 
The description of the algorithm is complex, which led us to adopt a use- 
ful graphical representation (Feynman diagrams) borrowed from quantum 
field theory. We present an implementation of the algorithm that gener- 
ates the optimal minimum energy structure for a single RNA sequence, 
using standard RNA folding thermodynamic parameters augmented by a 
few parameters describing the thermodynamic stability of pseudoknots. 
We demonstrate the properties of the algorithm by using it to predict 
structures for several small pseudoknotted and non-pseudoknotted RNAs. 
Although the time and memory demands of the algorithm are steep, we 
believe this is the first algorithm to be able to fold optimal (minimum 
energy) pseudoknotted RNAs with the accepted RNA thermodynamic 
model. 
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1 INTRODUCTION 



Many RNAs fold into structures that are important for regulatory, catalytic, or 
structural roles in the cell. An RNA's structure is dominated by base pairing in- 
teractions, most of which are Watson-Crick pairs between complementary bases. 
The base paired structure of an RNA is called its secondary structure. Because 
Watson-Crick pairs are such a stereotyped and relatively simple interaction, 
accurate RNA secondary structure prediction appears to be an achievable goal. 

A rather reliable approach for RNA structure prediction is comparative se- 
quence analysis, in which covarying residues (e.g. compensatory mutations) 
are identified in a multiple sequence alignment of RNAs with similar structures 
but different sequences (Woese & Pace, 1993). Covarying residues, particularly 
pairs which covary to maintain Watson-Crick complementarity, are indicative 
of conserved base pairing interactions. The accepted secondary structures of 
most structural and catalytic RNAs were generated by comparative sequence 
analysis. 

If one has only a single RNA sequence (or a small family of RNAs with little 
sequence diversity) comparative sequence analysis cannot be applied. Here the 
best current approaches are energy minimization algorithms (Schuster et al, 
1997). While not as accurate as comparative sequence analysis, these algorithms 
have still proven to be useful research tools. Thermodynamic parameters are 
available for predicting the AG of a given RNA structure (Freier et ai, 1986; 
Serra et ai, 1995). The Zuker algorithm (implemented in the programs MFOLD 
(Zuker, 1989a) and ViennaRNA (Schuster et ai, 1994)) is an efficient dynamic 
programming algorithm for identifying the globally minimal energy structure for 
a sequence, as defined by such a thermodynamic model (Zuker & Stiegler, 1981; 
Zuker & Sankoff, 1984; Sankoff, 1985). The Zuker algorithm requires 0{N^) 
time and 0{N'^) space for a sequence of length iV, and so is reasonably efficient 
and practical even for large RNA sequences. The Zuker dynamic programming 
algorithm was subsequently extended to allow experimental constraints, and 
to sample suboptimal folds (Zuker, 1989b). McCaskill's variant of the Zuker 
algorithm calculates probabilities (confidence estimates) for particular base pairs 
(McCaskill, 1990). 

One well-known limitation of the Zuker algorithm is that it is incapable of 
predicting so-called RNA pseudoknots. This is the problem that we address in 
this paper. 

The thermodynamic model for non-pseudoknotted RNA secondary struc- 
ture includes some stereotypical interactions, such as stacked base-paired stems, 
hairpins, bulges, internal loops, and multiloops. Formally, non pseudoknotted 
structures obey a "nesting" convention: that for any two base pairs i,j and fc, I 
(where i < j and k < I), either i<k<l<j or k<i<j<l. It is precisely 
this "nesting" convention that the Zuker dynamic programming algorithm re- 
lies upon to recursively calculate the minimal energy structure on progressively 
longer subsequences. An RNA pseudoknot is defined as a structure contain- 
ing base pairs which violate the nesting convention. An example of a simple 
pseudoknot is shown in Figure 0. 

RNA pseudoknots are functionally important in several known RNAs (ten 
Dam et ai, 1992). For example, by comparative analysis, RNA pseudoknots are 
conserved in ribosomal RNAs, the catalytic core of group I introns, and RNase P 
RNAs. Plausible pseudoknotted structures have been proposed (Florentz et al., 
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Figure 1: A simple pseudoknot. In a pseudoknot, nucleotides inside a hairpin 
loop pair with nucleotides outside the stem-loop. 

1982), and recently confirmed (Kolk et a/., 1998) for the 3' end of several plant 
viral RNAs, where pseudoknots are apparently used to mimic tRNA structure. 
In vitro RNA evolution (SELEX) experiments have yielded families of RNA 
structures which appear to share a common pscudoknottcd structure, such as 
RNA ligands selected to bind HIV-1 reverse transcriptase (Tuerk et al., 1992). 

Most methods for RNA folding which are capable of folding pseudoknots 
adopt heuristic search procedures and sacrifice optimality. Examples of these 
approaches include quasi-Monte Carlo searches (Abrahams et al., 1990) and 
genetic algorithms (Gultyaev et al., 1995; van Batenburg et al., 1995). These 
approaches arc inherently unable to guarantee that they have found the "best" 
structure given the thermodynamic model, and consequently unable to say how 
far a given prediction is from optimality. 

A different approach to pseudoknot prediction based on the maximum weigh- 
ted matching (MWM) algorithm (Edmonds, 1965; Gabow, 1976) was introduced 
by Gary and Stormo (1995). Using the MWM algorithm, an optimal structure 
is found, even in the presence of complicated knotted interactions, in 0{N^) 
time and 0{N'^) space. However, MWM seems best suited to folding sequences 
for which a previous alignment exists. In the scoring system used by Gary and 
Stormo, weights are assigned by comparative analysis. It is not clear to us that 
the MWM algorithm will be amenable to folding single sequences or collections 
of sequences which present little variation respect to each other. However, we 
believe that this was the first work that indicated that optimal RNA pseudoknot 
predictions can be made with polynomial time algorithms. It had been widely 
believed, but never proven, that pseudoknot prediction would be an NP problem 
(NP = nondeterministic polynomial; e.g. only solvable by heuristic or brute 
force approaches). 

In this paper we describe a dynamic programming algorithm which finds 

optimal pscudoknottcd RNA structures. We describe the algorithm using a 
diagrammatic representation borrowed from quantum field theory (Feynman 
diagrams). We implement a version of the algorithm that finds minimal energy 
RNA structures using the standard RNA secondary structure thermodynamic 
model (Freier et al., 1986, Serra et al., 1995), augmented by a few pseudoknot- 
specific parameters that are not yet available in the standard folding parameters, 
and by coaxial stacking energies (Walter et al., 1994) for both pscudoknottcd 
and non-pseudoknotted structures. We demonstrate the properties of the algo- 



2 





ss 



stem hairpin 



hairpin 



bulge 



Figure 2: Diagrammatic representation of RNA most relevant secondary struc- 
tures, including a pseudoknot. 



rithm by testing it on several small RNA structures, including both structures 
thought to contain pseudoknots and structures thought not to contain pseudo- 
knots. 



In this section we will introduce a diagrammatic way of representing RNA fold- 
ing algorithms. We will start by describing the Nussinov algorithm (Nussinov 
et aL, 1978), and the Zuker-Sankoff algorithm (Zukcr & Sankoff 1984; Sankoff 
1985) in the context of this representation. Later on we will extend the dia- 
grammatic representation to include pseudoknots and coaxial stackings. The 
Nussinov and Zuker-Sankoff algorithms can be implemented without the dia- 
grammatic representation, but this representation is essential to manage the 
complexity introduced by pseudoknots. 

2.1 Preliminaries 

From here on, unless otherwise stated, a flat solid line will represent the back- 
bone of a RNA sequence with its 5' end placed in the left hand side of the 
segment. N will represent the length (in number of nucleotides) of the RNA. 

Secondary interactions will be represented by wavy lines connecting the two 
interacting positions in the backbone chain, while the backbone itself always 
remains flat. No more than two bases arc allowed to interact at once. This 
representation does not provide insight about real (3D) spatial arrangements, 
but is very convenient for algorithmic purposes. When necessary for clarifica- 
tion single stranded regions will be marked by dots, but when unambiguous. 
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dots will be omitted for simplicity. Using this representation (figure g) we can 
describe hairpins, bulges, stems, internal loops and multiloops as simple nested 
structures; a pseudoknot, on the other hand, corresponds to a non-nested struc- 
ture. 

2.2 Diagrammatic representation of nested algorithms 

In order to describe a nested algorithm we need to introduce two triangular 
N X N matrices, to be called vx and wx. These matrices are defined in the 
following way: vx(i,j) is the score of the best folding between positions i and 
j provided that i and j are paired to each other; whereas 'wx{i,j) is the score 
of the best folding between positions i and j regardless of whether i and j pair 
to each other or not. These matrices are graphically represented in the form 
indicated in fig. ^. The filled inner space indicates that we do not know how 
many interactions (if any) occur for the nucleotides inside, in contrast with 
a blank inner space which indicates that the fragment inside is known to be 
single stranded. The wavy line in vx, as previously, indicates that i and j are 
definitely paired, and similarly the discontinuous line in wx indicates that the 
relation between i and j is unknown. Also part of our convention is that for a 
given fragment, nucleotide i is at the 5'-end, and nucleotide j is at the 3'-end, 
so that i < j. 



i vx j 

Figure 3: Wx and vx matrices. 

The purpose of a dynamic programming algorithm is to fill the vx and wx 
matrices with appropriate numerical weights by means of some sort of recursive 
calculation. 

The recursion relations used to fill the wx matrix include: single-stranded 
nucleotides, external pairs, external dangling bases, and bifurcations. The ac- 
tual recursion is easier to understand by looking at the diagrams involved (given 
in fig. 0) and the recursion can be expressed as, 
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Figure 4: Recursion for wx in the nested algorithm 
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(1) 

Each line gives the formal score of one of the diagrams in fig. ^. The diagram 
on the left is calculated as the score of the best diagram on the right. Here P is 
some value that represents the score for a base pair. Q represents the score for 
a single stranded nucleotide, whereas i^+i^j and R\ j^i stand for the score of 
a nucleotide dangling off a base pair of nucleotides, at the 5'-end or the 3'-end 
respectively. 

The recursion for vx includes hairpins, bulges, internal loops, and multiloops. 
But what is special about hairpins, bulges, internal loops, and multiloops in this 
diagrammatic representation? To answer this question we have to introduce two 
more definitions: Surfaces (S) and Irreducible Surfaces (IS). 

Roughly speaking a Surface is any alternating sequence of solid and wavy 
lines that closes on itself. An Irreducible Surface is a Surface such that if one 
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of the H-bonds (or secondary interactions) is broken there is no other surface 
contained inside, that is, an IS cannot be "reduced" to any other surface. (ISs 
are similar to the "k-loops" defined by Sankoff (1985).) The order, O, of an IS 
is given by the number of wavy lines (secondary interactions) , which is equal to 
the number of solid-line intervals. It is easy to see that hairpin loops constitute 
the ISs of 0(1); stems, bulges and internal loops are all the ISs of 0(2), and 
what are referred to in the literature as "multiloops" are the ISs of O > 2. 
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Figure 5: General recursion for vx in the nested algorithm. 
The actual recursion for vx is given in fig. ^ and can be expressed as, 



vx(i,j) — optimal < 



IS\i,j:k,l) 



k,l 
kj 



+vx{m, n) 
C'(/S'5) 



-f vx{k, I) 

m, n) + vx{k, I) + vx{m, n) 
m,n : r, s) + vx{k, I) 
vx{r, s) 



(2) 



[Vfc, l,m,n,r, s, i<k<l<r<s<m<n<j] 

This recursion is an expansion in ISs of successively higher order. Here /S*" (ii , ji : 
12,32 ■ in,jn) represent the score for an IS of order n, in which is paired to 
jfc. This general algorithm is quite impractical, because each IS of 0(7) adds a 
complexity of 0{N'^'^) to the calculation. [An IS of 0{N'^'^) requires us to search 
through 27 independent segments in the entire sequence of N nucleotides.] To 
make it useful we have to truncate the expansion in IS's at some order in the 
recursion for vx in fig. ||. The symbol 0{IS^) indicates the order of IS at which 
we truncate the recursion. (Note that the recursion for vox will remain always 
the same.) 

These recursions are equivalent to those proposed by Sankoff (1985) in The- 
orem 2. Notice also that in defining the recursive algorithm we have not yet 
had to specify anything about the particular manner in which the contribution 
from different IS's are calculated in order to obtain the most optimal folding. 

The simplest truncation is to stop at order zero. In this approximation none 
of the ISs (hairpin, bulge, internal loop...) are given any specialized scores. We 
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Figure 6: Recursion for vx truncated at 0(0) 

only have to provide a specific score for a base pair, B. The recursion for vx 
then simphfies to fig. ^ and can be cast into the form, 

vx{i,i) — B + wx(i + 1, j — 1). (3) 

If we set i? = P = +1, and Q = in equation (|^) then we have the Nussinov 
algorithm (Nussinov et ai, 1978). This simple algorithm calculates the folding 
with the maximum number of base pairs. 
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Figure 7: Recursion for vx truncated at 0(2). 

The next order of complexity we explore corresponds to a truncation at ISs of 
0(2). Hairpin loops, bulges, stems, and internal loops are treated with precision 
by the scoring functions IS^ and IS^. The rest of ISs, collected under the name 
of "multiloops" — which are much less frequent than the previous — are described 
in an approximate form. The diagrams of this approximation are given in fig. 0, 
and correspond to, 



vx{i,j) — optimal 



IS^{i,j : k,l)+vx{k,l) 

Pi + M + wxi{i + 1, fc) + wxi{k + 1, j - 1) 

[Vfc, I i<k<l<j] 



(4) 



This is the algorithm described by Sankoff (1985) in theorem 3. This is the 
approximation that MFOLD (Zuker, 1981) and VicnnaRNA (Schuster et ai, 
1994) implement. Pi stands for the scoring parameter for a pair in a multiloop, 
and parameter M stands for the score for a multiloop. Note that the matrix 
wxj used to truncate the recursion for vx in (^ does not have to be the same as 
the one used in (|l|). Matrix wxj is used exclusively for diagrams which will be 
incorporated into multiloops. Although both matrices wx and wxj have similar 
recursions, the parameters of these two recursions will have in general different 
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values (represent here by P, Q, i, R and Pj, Qi, Lj, Rj respectively). This 
feature is implemented both in MFOLD and in our program. 

Higher orders of specificity of the general algorithm are possible, but are 
certainly more time consuming, and they have not been explored so far. One 
reason for this relative lack of development is that there is little information 
about the energetic properties of multiloops. The generalized nested algorithm 
provides a way to unify the currently available dynamic algorithms for RNA 
folding. At a given order, the error of the approximation is given by the differ- 
ence between the assigned score to "multiloops" and the precise score that one 
of those higher order ISs deserves. 

2.3 Description of the pseudoknot algorithm 

Pseudoknots are non-nested configurations and clearly cannot be described with 
just the wx and vx matrices we introduced in the previous section. The key point 
of the pseudoknot algorithm is the use of gap matrices in addition to the wx 
and vx matrices. Looking at the graphical representation of one of the simplest 
pseudoknots, fig. ^ we can see that we could describe such a configuration by 
putting together two gap matrices with complementary holes. 

The pseudoknot dynamic programming algorithm uses one-hole or gap ma- 
trices (fig. ^ as a generalization of the wx and vx matrices. Let us define 
whx{i,j : k,l) as the graph that describes the best folding that connects seg- 
ments [i, k] with j], i < k < I < j such that the relation between i and j and 
k and / is undetermined. Similarly we define vhx{i,j : k,l) as the graph that 
describes the best folding that connects segments [i, k] with [I, j], i < k < I < j 
such that i and j are base-paired and k and / are also base-paired. For com- 
pleteness we have to introduce also matrix yhx{i,j : k,l) in which k and / are 
paired, but the relation between i and j is undetermined, and its counterpart 
zhx{i,j : k,l) in which i and j are paired, but the relation between k and I is 
undetermined. 



matrix 
{i<k<l<j) 


relationship 
h j 


relationship 
fc, I 


wx{i,j) 
vx{i,j) 


undetermined 
paired 




whx(i, j : fc, I) 
vhx{i, j : k, I) 
zhx{i, j : fc, I) 
yhx{i,j : k,l) 


undetermined 

paired 

paired 
undetermined 


undetermined 

paired 
undetermined 

paired 



Table 1: Specifications of the matrices used in the pseudoknot algorithm. 

The non-gap matrices wx, vx are contained as a particular case of the gap 
matrices. When there is no hole, k = I — 1, then by construction, 

whx{i, i : k,k + \) = wx{i,j) (5) 
zhx{i, j : k, k + I) = vx{i,j) Vfc, i<k<j. 

We have the gap matrices as the building blocks of the algorithm, but how 
do we establish a consistent and complete recursion relation? Here is where the 
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Figure 8: Construction of a simple pseudoknot using two gap matrices. 
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analogy between the gap matrices and the Feynman diagrams of quantum field 
theory was of great help (Bjorken & DrcU 1965)]] 

Let us start with the generalization of the recursions for wx and vx in the 
presence of gap matrices. A non-gap matrix can be obtained by combining two 
gap matrices together, therefore the recursions for wx and vx add one more dia- 
gram with two gap matrices to recursions and (^. Again the diagrammatic 
representation (fig. ^) is more helpful than words in explaining the recur- 
sion. Note that the new term introduced in both recursions involves two gap 
matrices. In fact the recursion is an expansion in the number of gap matrices 
necessary at each step of the recursion. 



wx(i,f) — optimal < 



P + vx{i,j) 

Ll+i,j-i + R^+i.j-i +P + + 1, i ~ 1) 

Ll_^^^^+P + vx{i + lJ) 
Rlj^,+P + vx{i,j-l) 
Q + wx{i + 1, j) 
Q + wx{i,j — i) 
wx{i, k) + wx{k + 1, i) 

Gw + whx{i, r : k,l) + whx{k -|- 1, j : / — 1, r + 1) 
0{whx whx ^- whx) 



(6) 



Where Gw denotes the score for introducing a pseudoknot. 
Similarly for vx, 

IS^{i,j : k,l)+vx{k,l) 
Pi + M + wxi{i + l,k) + wxi{k + l,j - 1) 
Pi + M + Gwi + whx{i + l,r : k,l) 

+whx{k + -l:l-l,r + l) 
0{whx + whx + whx) 



vx{i,j) = optimal < 



(7) 



More precisely, the analogy is more cleanly expressed in terms of Schwinger-Dyson dia- 
grams which in QFT are used to represent full interacting vertices and propagators recursively 
in terms of elementary interactions. 
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Figure 10: Recursion for wx in the pseudoknot algorithm truncated at 0{whx~ 
whx + whx). 



[rfi, k,l,r, j i<k<l<r< j] 

Here M stands for a generic score for generating a non-nested multiloop, and 
Gwi stands for the score for generating an internal pseudoknot. 

Practical considerations make us truncate the expansion at this point, so we 
will not include diagrams that require three or more gap matrices. This state- 
ment should not mislead one into thinking that we cannot deal with complicated 
pseudoknots. The recursive nature of the approximation allows us to describe 
overlapping pseudoknots (defined as those pseudoknots for which a planar rep- 
resentation does not require crossing lines) as well as non-planar pseudoknots 
(for which a planar representation requires crossing lines). The Escherichia 
coli a mRNA presented by Gluick et al. (1994) is an example of a non-planar 
RNA pseudoknot that can be parsed using the pseudoknot algorithm. However 
the algorithm is not able to find all possible knotted configurations (fig. [T^ ). 
Nevertheless, the approximation seems to be adequate for the currently known 
pseudoknots in RNA folding. 

Note that two approximations are involved in the algorithm. Apart from that 
just mentioned (how to truncate the infinite expansion in gap matrices to make 
the algorithm polynomial) , we also use the approximation previously introduced 
for the nested algorithm (that IS's of O > 2 or multiloops are described in some 
approximated form). 

The algorithm is not complete until we provide the full recursive expressions 
to calculate the gap matrices. For a given gap matrix, we have to consider 
in how many different ways that diagram can be assembled using one or two 
mat rices at a time. The full description of those diagrams is given in Subsection 
5.2 . (Again Feynman diagrams are of great use here.) 
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Figure 11: Recursion for vx in the pseudoknot algorithm truncated at 0{whx+ 
whx + whx). 



2.4 Coaxial stacking 

It is quite frequent in RNA folding to create a more stable configuration when 
two independent configurations stack coaxially. That occurs for instance, when 
two hairpin loops with their respective stems are contiguous, so that one of them 
can fall on top of the other creating a more stable configuration than when the 
two hairpins just coexist without interaction of any kind. 

The algorithm implements coaxial energies for both nested and non-nested 
structures. We adopt the coaxial energies provided by Walter et al. (1994) for 
coaxial stacking of nested structures. For coaxial stacking of non- nested struc- 
tures we multiply these previous energies by an estimated (ad hoc) weighting 
parameter g <1. 

Using our diagrammatic representation it is possible to be systematic in 
describing the possible coaxial stacking that can occur. In the general recur- 
sion one has to look for contiguous nucleotides and allow them to be explicitly 
paired — but not to each other. This is best understood with an example. Con- 
sider the recursion for wx in fig. |lO[ in particular the bifurcation diagram 

wxli,i) — > wx{i,k) + wx{k + 1, i), yk,i<k<j. 

In order to allow for the possibility of coaxial stacking such diagram has to be 
complemented with another one in which the nucleotides of the bifurcation are 
base-paired 

wx{i,j) — > vx{i,k) + vx{k + + C{k,i : k + ^k,i<k<j. 
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Figure 12: Top: The non-planar pseudoknot presented in a mRNA and how 

to build it with gap matrices. The Roman numbers correspond to the num- 
bering of stems introduced by Gluick et al. (1994). Bottom: An example of a 
pseudoknot that the algorithm cannot handle: interlaced interactions as seen in 
proteins in parallel /3-shcct. The assembly of this interaction using gap matrices 
would require us to use four gap matrices at once which is not allowed by the 
approximation at hand. 
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This new diagram indicates that if nucleotides k and k + 1 are paired to nu- 
cleotides i and j respectively, that configuration is specially favored by an 
amount C(k,i : k + (presumably negative in energy units) because both 
sub-structures, vx{i, k) and vx{k -\- 1, j), will stack onto each other. 

Notice also that the new diagram really corresponds to four new diagrams 
because once we allow pairing, dangling bases have also to be considered, so the 
full nearest-neighbour interaction is taken into account. For purposes of clarity 
we will not explicitly specify any of the extra diagrams with dangling involved. 
The rest of the additional diagrams to be include d in the recursions to take care 
of coaxial stackings are also given in Subsection 5^ along with the full set of 
diagrams. Coaxial diagrams can be recognized by the empty dots representing 
the contiguous coaxially-stacking nucleotides. 



2.5 Minimum-energy implementation. Thermodynamic 
parameters 

We have implemented the pseudoknot algorithm using thermodynamic param- 
eters in order to fill the scoring matrices, both gapped and ungapped. For 
the relevant nested structures: hairpin loops, bulges, stems, internal loops and 
multiloops we have used the same set of energies as used in MFOLD.^ Free en- 
ergies for coaxial stacking C where obtained from Walter et at, (1994). Table || 
provides a list of the parameters used for nested conformations. 



Symbol 


Scoring parameter for 


Value(Kcal/mol) 


IS' 


hairpin loops 


varies 


IS'' 


bulges, stems and int loops 


varies 


c 


coaxial stacking 


varies 


p 


external pair 





Q 


single stranded base 





R,L 


base dangling off an external pair 


dangle + Q 


Pi 


pair in a multiloop 


0.1 


Qi 


not paired base inside multiloop 


0.4 


Ri, Lj 


base dangling off a multiloop pair 


dangle + Qi 


M 


nested multiloop 


4.6 



Table 2: This table includes all the parameters for which there is thermody- 
namic information provided by the Turner group. This parameters are identical 



to those used in MFOLD ( ht tp : / / www . ibc . wust 1 . edu / ~ zuker /rna ) 



For the non-nested configurations, there is not much thermodynamic infor- 
mation available (Wyatt et ai, 1990; Gluick et ai, 1994). This is not an untyp- 
ical situation; there is already very little thermodynamic information available 
for regular multiloops, let alone for pseudoknots. We had to tune by hand the pa- 
rameters related to pseudoknots. For some non-nested structures we multiplied 
the nested parameters by an estimated weighting parameter g < 1. It would be 

^ Since the implementation of the pseudoknot algorithm the Turner group has produced a 
new complete and more accurate list of parameters (Matthews et al., 1998) which we have 
not yet implemented. 
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very useful, in order to improve the accuracy of this thermodynamic implemen- 
tation of the pseudoknot algorithm, to have more accurate experimentally-based 
determinations of these parameters. Table ^ provides a list of the parameters 
we used for pseudoknot-related conformations. 



Symbol 


Scoring parameter for 


Value (Kcal/mol) 


~'i 
IS 


IS^ in a gap matrix 


IS^ * .g(0.83) 


c 


coaxial stacking in pseudoknots 


G*g 


p 


pair in a pseudoknot 


0.1 


Q 


not paired base in pseudoknot 


0.2 


R,L 


base dangling off a pseudoknot pair 


dangle * g + Q 


M 


non-nested multiloop 


8.43 


Gw 


generating a new pseudoknot 


7.0 


Gwi 


generating a pseudoknot in a multiloop 


13.0 


Gwh 


overlapping pseudoknots 


6.0 



Table 3: In this table we introduce the new thermodynamic parameters specific 
for pseudoknot configurations what we had to estimate. 



3 RESULTS 

The main purpose of this paper is to present an algorithm that solves opti- 
mal pseudoknotted RNA structures by dynamic programming. RNA structure 
prediction of single sequences with nested algorithms already involves some ap- 
proximation and inaccuracy (Zuker, 1995; Huynen et al, 1997). We expect this 
inaccuracy to increase in our case since the algorithm now allows a much larger 
configuration space. Therefore our limited objective here is to show that on 
a few small RNAs that are thought to conserve pseudoknots, our program (a 
minimal-energy implementation of the pseudoknot algorithm using a thermo- 
dynamic model) will actually find the pseudoknots; and for a few small RNAs 
that do not conserve pseudoknots, our program finds results similar to MFOLD, 
and does not introduce spurious pseudoknots. 

3.1 tRNAs 

Almost all transfer RNAs (tRNA) share a common cloverleaf structure. We 
have tested the algorithm on a group of 25 tRNAs selected at random from the 
Sprinzl tRNA database (Steinberg et ai, 1993). The program finds no spurious 
pseudoknot for any of the tested sequences. All tRNAs fold into a cloverleaf 
configuration but one (DT5090). Of the 24 cloverleaf foldings, 15 are completely 
consistent with their proposed structures (that is, each helical region has at least 
3 base pairs in common with its proposed folding). The remaining 9 cloverleaf 
foldings misplace one (6 sequences) or two (3 sequences) of the helical regions. 
On the other hand, MFOLD's lowest-energy prediction for the same set of tRNA 
sequences includes only 19 cloverleaf foldings of which 14 are completely consis- 
tent with their proposed structures. Performance for our program is therefore 
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at least comparable to MFOLD; the inaccuracies found are the result of the ap- 
proximations in the thermodynamic model, not a problem with the pseudoknot 
algorithm per se. The relevant result in relation to the pseudoknot algorithm is 
that its implementation predicts no spurious pseudoknots for tRNAs. 

One should not think of this result as a trivial one, because when knots 
are allowed, the configuration space available becomes much larger than the 
observed class of conformations. This problem is particularly relevant for "maxi- 
mum-pairing-like" algorithms, such as the MWM algorithm presented by Gary 
& Stormo (1995) or a Nussinov implementation of our pseudoknot algorithm 
(fig H). In both cases, the result is almost universal pairing because there is 
enough freedom to be able to coordinate any position with another one in the 
sequence. 

Another important aspect of tRNA folding is coaxial energies. Most tRNAs 
gain stability by stacking coaxially two of the hairpin loops, and the third one 
with the acceptor stem. This aspect of tRNA folding is very important and in 
some cases crucial to determine the right structure. There are situations like 
tRNA DA0260 in which MFOLD does not assign the lowest energy to the cor- 
rect structure (the MFOLD 3.0 prediction for DA0260 misses the acceptor stem, 
and has a free energy of —22.0 Kcal/mol). Our algorithm, on the other hand, 
implements coaxial energies; as a result the cloverleaf configuration becomes the 
most stable folding for tRNA DA0260 (AG = -24.3 Kcal/mol). The implemen- 
tation of coaxial energies explains why we found more cloverleaf structures for 
tRNAs than MFOLD does. 

3.2 HIV-l-RT-ligand RNA pseudoknots 

High-afhnity ligands of the reverse transcriptase of human immunodeficiency 
virus type 1 (HIV-1) isolated by a SELEX procedure by Tuerk et al. (1992) seem 
to have a pseudoknot consensus secondary structure. These oligonucleotides 
have between 34 and 47 bases and fold into a simple pseudoknot. Of a total of 63 
SELEX-selected pseudoknotted sequences available from Tuerk et al. (1992), we 
found 54 foldings that agreed exactly with the structures derived by comparative 
analysis. (AG = —10.9 Kcal/mol for sequence pattern I (3-2)). As expected, 
MFOLD predicts only one of the two stems (AG = —7.5 Kcal/mol for the same 
sequence) . 

3.3 Viral RNAs 

Some virus RNA genomes [such as turnip yellow mosaic virus (TYMV) (Guil- 
ley et al, 1979)] present a tRNA-like structure at their 3' end that includes a 
pseudoknot in the aminoacyl acceptor arm very close to the 3' end (Kolk et al, 
1998; Florentz et al., 1982; Dumas et al, 1987). Our program predicts correctly 
the TYMV tRNA-like structure with its pseudoknot for the last 86 bases at the 
3' end with AG = -30.4 Kcal/mol (the MFOLD 3.0 prediction for TYMV has 
a free energy of AG = —28.9 Kcal/mol). The tRNA-like 3' terminal structure 
is conserved among tymoviruses and also for the tobacco mosaic virus cowpea 
strain (CcTMV) another valine acceptor. Of the seven valine- acceptor tRNA- 
like structures proposed to date (Van Belkum et al, 1987) we reproduce six of 
them, except for kennedya yellow mosaic virus (KYMV) . 
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Finally wc have considered the last 189 bases of the 3' terminal of the To- 
bacco mosaic virus (TMV) (Van Belkum et al, 1985). TMV also has a tRNA-like 
structure at the end, but it may have additional upstream pseudoknots, up to a 
total of five, forming a long quasi-continuous helix. We folded the upstream and 
downstream regions separately in a piece of 84 nucleotides (the folding requires 
52 minutes and 9.8 Mb) and 105 nucleotides (the folding requires 246 minutes 
and 22.5 Mb) respectively. Our program predicts the 105 nucleotides down- 
stream region exactly with AG = —32.5 Kcal/mol. For the 84 nucleotides up- 
stream region we find four of the five helical regions with AG = —19.0 Kcal/mol. 

4 DISCUSSION 

In this paper we present an algorithm able to predict pseudoknots by dynamic 
programming. This algorithm demonstrates that using certain approximations 
consistent with the accepted Turner thermodynamic model, the prediction of 
pseudoknotted structures is a problem of polynomial complexity (although ad- 
mittedly high) . Having an optimal dynamic programming algorithm will enable 
extending other dynamic programming based methods that rigorously explore 
the conformational space for RNA folding (McCaskill, 1990; Bonhoeffer et al., 
1993) to pseudoknotted structures. 

Apart from the usefulness of the algorithm in predicting pseudoknots, we 
also include coaxial energies (when two stems stack coaxially), a very common 
feature of RNA folding. We expect MFOLD will also include coaxial energies 
in the near future (Matthews et al., 1998). 

Our algorithm is presented in the context of a general framework in which 
a generic folding is expressed in terms of its elementary secondary interactions 
(which we have identified as the irreducible surfaces). This is a further general- 
ization of Sankoff 's result (1985). The calculation of an optimal folding becomes 
an expansion in ISs of increasingly higher order. Our formalization incorporates 
all current dynamic programming RNA folding algorithms in addition to our 
pseudoknot algorithm. It also establishes the limitations of each approximation 
by determining at which order the expansion is truncated. 

As for the thermodynamic implementation presented in this paper, one of 
our major problems is the almost complete lack of thermodynamic information 
about pseudoknot configurations. The thermodynamic algorithm is also sen- 
sitive to the accuracy of the existing thermodynamic parameters. We expect 
to improve this aspect by implementing the more complete set of parameters 
provided by the Turner group (Matthews et al., 1998). 

The principal drawback is the time and memory constraints imposed by 
the computational complexity of the algorithm. At this early stage, we cannot 
analyze sequences much larger than 130-140 bases. For now the program is 
adequate for folding small RNAs. A 100 nucleotide RNA takes about 4 hours 
and 22.5 MB to fold on an SGI RIOK Origin200. 

Due to practical limitations, at a given point in the recursion we only allow 
the incorporation of two gap matrices. However, since each of those gap matrices 
can in turn be assembled by other two of those matrices, it implies that the 
algorithm includes in its configuration space a large variety of knotted motifs. 
The limitations of this truncation appeared when we considered applying this 
approach to describe pairwise residue interactions in protein folding. A parallel 
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/9-sheet configuration provides an example of a complicated knotted folding 
that cannot be handled by the pseudoknot algorithm presented in this paper. 
However, all known RNA pseudoknots can be handled by the algorithm, which 
makes the approximation useful enough for RNA secondary structure. 

Although we implemented the algorithm for energy minimization, extending 
MFOLD to pseudoknotted structures, the algorithm is not limited to energy 
minimization. Our algorithm can be converted into a probabilistic model for 
pseudoknot-containing RNA folding. Probabilistic models of RNA secondary 
structure based on "stochastic context free grammar" (SCFG) formalisms (Eddy 
et ai, 1994; Sakakibara et at, 1994; Lefebvre, 1996) have been introduced both 
for RNA single-sequence folding and for RNA structural alignment and struc- 
tural similarity searches. The Inside and CYK dynamic programming algo- 
rithms used for SCFG-based structural alignment are fundamentally similar to 
the Zuker algorithm (Durbin et ai, 1998), and have consequently also been 
unable to deal with pseudoknots. Heuristic approaches to applying SCFG-like 
structural alignment models to pseudoknots have been introduced (Brown, 1996; 
Notredame et ai, 1997). An SCFG-like probabilistic version of our pseudo- 
knot algorithm could be designed to obtain optimal structural alignment of 
pseudoknot-containing RNAs. 

5 METHODS 

5.1 Implementation 

The algorithm was implemented on a Silicon Graphics Origin200. The algorithm 
has a theoretical worst-case complexity of 0{N^) in time and 0{N'^) in storage. 
At its present stage, the program is empirically observed to run 0{N^'^) in 
time and 0{N^^^) in memory. For instance, a tRNA of 75 nucleotides takes 
24 minutes and uses 6.6 Mb of memory. The 3' end of tobacco mosaic virus 
has 105 nucleotides and takes 246 minutes and uses 22.5 Mb. The program 
empirically scales above the theoretical complexity in time of the algorithm. 
This effect seems to have to do with the way the machine allocates memory for 
larger RNAs. The software and parameter sets are available by request from E. 
Rivas (elena@genetics.wustl.edu). 

5.2 Complete set of diagrams for the pseudoknot algo- 
rithm 

In this section we provide the complete recursion relations for all the matrices 
used in the pseudoknot algorithm. 

The recursion for the non-gap matrix wx is given by (cf. fig. [T^): 
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p 



vx{i,j) 



Ll^,,j+P + vx{i + l,j} 
Pij-i+P + vxii,j -1) 

Q + wx{i + 
Q + wx{i,j - i) 

wx{i, k) + wx{k + 1, i) 

C{k, i : fc + 1, j) + vx{i, k) + vx{k + 1, j) 

Gw + whx{i, r : k,l) + whx{k + 1, j : / — 1, r 
2 * P + + C{1 - l,r + 1 : l,k) 

+yhx{i, r : k,l) + yhx{k + 1, j : / — 1, r + 



wx{i,j) — optimal < 



1) 



paired 
dangles 



strfrfded 



lested . 
jiiurcations 



1 on-nested 
jiiurcations 



(8) 

Here G^ denotes a score for introducing a pseudoknot, and C{1, r : I + 1, k) is a 
special score for a coaxial stacking of pairs (/, r) and {I + 1, k). 

We should also remember that the algorithm uses two different wx matrices 
depending on whether the subset is free-standing (wx) or appears inside a 
multiloop (in which case we use wxi). Both recursions are identical apart from 
having different coefficients as described in table |2[ 

The recursion for the non-gap matrix vx is given by (cf. fig. |lj): 



vx{i,j) = optimal < 



IS^{i,j : k,l)+vx{k,l) 

Pi + M + wxj{i + l,k)+ wxi{k + 1, j - 1) 

2 * Ff + C{i,j : i + l,k) + M + vx{i + l,k) + wxi{k + 1, j - 1) 

2 * Pf -t- c{j - 1, fc + 1 : j, i) + M + wxi{i + 1, fc) + vx{k + 1, j - 1) 

?,*Pi + C{k,i' : k + l,j')+M + vx{i',k) + vx{k + l,j') 

Pi + M + Gnii + whx{i + l,r -.kj) 

+whx{k + l,j - 1 : I - l,r + 1) 
2*Pi + M + G^i + G{i,j,:i + l,r) 

+zhx{i + l,r : k,l) + whx{k + 1, j — 1 : i — 1, r + 1) 
2*Pi + M + G^i + C{j -l,k + l: j, i) 

+whx{i + l,r : k,l) + zhx{k + l,j — 1:^ — l,r + l) 
?,*Pi + M + G„i + G{l-l,r+l ■.l,k) 

+yhx{i + l,r : k,l) + yhx{k + l,j - 1 : I - l,r + I) 



IS(1) 
IS(2) 



nested 
multiloops 



(9) 



[\^i,i',k,l,r,j',j i<i' <k<l<r<f <]] 

Parameters P/, M, and Gwi are defined in table |[ 
The initialization conditions are 



non-nested 
multiloops 



wx{i,i) = 0, 
vx{i,i) — +00. 



(10) 
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Figure 13: Recursion for the wx matrix. 
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[Vi l<i<N] 

The recursion for the vhx matrix in the pseudoknot algorithm is given by 
(cf. fig. |l|): 



vhx{i,j : k,l) = optimal < 



IS : k,l) 

-2 

IS : r, s) + vhx{r, s : k,l) 
— 2 

IS (r, s : k,l) + vhx{i, j : r, s) 

2*P + M + whx{i + l,j -l:k-lj + l) 



(11) 



[Vi, r,k,l, s, j i<r<k<l<s<j] 

Here M correspond to the score given to a multiloop in a vhx gap matrix. It 

could be equal to M, the score defined for multiloops in the vx matrix, but it 

— 2 

does not have to be. Similarly, the score for an irreducible surface of 0{2), IS , 
could be the same as the one given for nested structures, IS^, but again, it does 
not have to be. We found the best fits by giving them values different to the 
ones used for nested foldings (cf . table || and table ^) . 

The recursions for the gap matrices zhx and yhx in the pseudoknot algorithm 
are complementary and given by (cf. fig. and fig. 



zhx{i,j : k,l) — optimal < 



P + vhx{i, j : fc, I) 

L + R + P + vhx{i,j : k~ 1,1 + 1) 
R + P + vhx{i,j -.k-lj) 
L + P + vhx{i,j -.kj + l) 

Q + zhx{i, j : k — 1,1) 
Q + zhx{i,j : k,l + 1) 

zhx{i,i : r,l) -\- wxi{r + 1, fc) 

2 * P + C{r, I -.r + l,k) + vhx(i,j ■■r,l) + vx{r + l,k) 
zhx{i, j : k,s) + wxjil, s — 1) 

2 * P + C{s — 1,1 : s,k) + vhx{i, j : k, s) + vx{l, s — 1) 
— -2 

IS (i, j : r,s) + zhx{r, s : k,l) 
P + M + whx{i + l,j - 1 : k,l) 

(12) 



paired 
dangles 



single 
stranded 



nested ^. 
bifurcations 
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yhx{i,j : k,l) = optimal < 



P + vhx{i, j : k, I) 

L + R + P + vhx(i + -1 : k,l) 
L + P + vhx{i + l,j : k,l) 
R + P + vhx{i,j ~1 : k,l) 

Q + yhx{i + 1, j : k, I) 
Q + yhx{i, j — 1 : k,l) 

wxi{i, r) + yhx{r + l,j : k,l) 

2* P + C{r, i : r + + vx{i, r) + vhx{r + l,j : k, I) 
yhx{i, s : k,l) + wxj{s + 1, j) 

2*P + C{s, i: s + l,j)) + vhx{i, s : k,l)+ vx{s + 1, j) 

2 

yhx{i, j : r, s) + IS (r, s : k,l) 
P + M + whx{i,j : k~ 1,1 + 1) 



(13) 

[Vi, r,k,l, s, j i<r<k<l<s<j] 
Finally, the recursion for the gap matrix whx appears in fig. nSL and is given 



paired 
dangles 

single, , 
stranded 

nested ^. 
biiurcations 



by: 
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whx{i,j : k,l) = optimal < 



2 * P + vhx{i,j : k, I) 

P + zhxii, j : fc, I) 
P + yhx{i, j : k, I) 

L + R + 2*P + vhx{i + l,j -.k-lj) 

L +J{ + 2*P + vhx{i,j -l:k,l + l) 

2*1 + 2* P^+vhx{i + l,j -.kj + l) 

2*R + 2 *P + vhx{i,j -1: k-l,l) 

L + 2*R + 2 *P + vhx{i + l,j - 1 : k - 1,1) 

2*L + R + 2 *P + vhx{i + ■.k-l,l + 1) 

2^*L + R + 2*P + vhx{i + l,j -1 :k,l + l) 

L + 2*R + 2 *P + vhx{i,j -l:k- 1,1 + 1) 

2 *Zj-2 + 2 *P + +vhx{i + l,j -l:k-l,l + l) 

L + R + P + zhx{i + l,j -1: k,l) 

1 + P + zhx{i + l,j : k,l) 

R + P + zhx{i,j -l:k,l) 

Z + R + P + yhx{i,j : k- 1,1 + 1) 

R + P + yhx{i,j : k-l,l) 

L + P + yhx{i,j ■.k,l + l) 

Q + whx{i + l,j : k,l) 

Q + whx{i, j — 1 : k, I) 
Q + whx{i, j : k — 1,1) 
Q + whx{i, j : k,l + 1) 

(14) 
( cont.) 



paired 



dangles 



single, , 
stranded 
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whx{i,j : k,l) = optimal < 



wxi{i, k) + wxi{l,j) 
wxi{i, r) + whx{r + l, j : k, I) 

2 * F + C(r, z : r + 1, j) + vx{i, r) + zhx{r + 1, j : k, I) 
whx{i,j : r, Z) + wxi{r + 1, k) 

2 *P + C{r,l : r + 1, fc) +yhx{i,j : r,l)+vx{r + l,k) 
whx{i, s : k,l) + wxi{s + 

2* P + C{s,i : s + l,j) + zhx{i, s : k,l) + vx{s + 1, j) 

whx{i, j : fc, s + 1) + wxi{l, s) 

2*P + C{s,l : s + l,k) + yhx{i,j : k, s + 1) + vx{l, s) 
yhx{i,j : r, s) + zhx{r, s : k,l) 
M + whx{i,j : r,s) + whx{r + l,s — 1 : k,l) 

Gwh + u!hx{i. s : rj.) + whx{r + 1, j : fc, s + 1) 
Gwh + whx{i, s' : fc, s) + whx{l, j : s — 1, s' + 1) 
2*P + G^h + Cis',i:s' + l,s-l) 

+zhx{i, s' : k,s) + yhx{l, j : s — l,s' + 1) 
2*P + G^h + Cis-l,s' + l:s,k) 

+yhx{i, s' : fc, s) + yhx(l,j : s — 1, s' + 1) 
Gwh + whx{r,j : r', I) + whx{i, fc : r — 1, r' + 1) 
2*P + G^h + C{r~l,r' + l:r,j) 

+zhx{r,j : r' , I) + yhx(i, fc : r — 1, r' + 1) 
2*P + Gwh + C{r',l:r' + l,r-l) 

+yhx{r, j : r',l) + yhx{i, fc : r — 1, r' + 1) 

(15) 

[\/i,r,r',kJ,s,s',j i<r<r'<k<l<s<s'<j] 

Here G^h stands for the score given for finding overlapping pseudoknots, 
that is pseudoknots that appear within already existing pseudoknots. 
The initialization conditions are 



nested ^. 
biiurcations 



noji-nested 
biiurcations 



whx{i,j : = +oo, 

vhx{i,j:k,k) = +oo, 

yhx(i, j : k,k) = +(X), 

whx{i,j:k,k) = whx{i,j:k,k+l)=wx{i,j), 

zhx{i,j:k,k) = zhx{i,j:k,k+l) = vx{i,j). 

[yi,kj l<i<k<j<N] 



(16) 
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Figure 18: Recursion for the whx matrix. 
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